In this report, we shall continue the analysis of a brain stroke dataset by computing variable attributions for the predictors, based on the predictive models we have fit previously, to see which are the most important. To this end, we will use shap and dalex libraries to compute Shapley Additive Explanations (SHAP values).
The agenda is as follows:
Shapley values comprise a game-theoretical concept for assigning importance to individual players in cooperative games. A cooperative game consists of a set of players $P$ and a "payoff function" $v: 2^P \to \mathbb{R}$ which assigns a payoff to each coalition, coalition being a subset of players. One measure of a significance of a single player $p \in P$ could be the surplus $v(S \cup \{p\}) - v(S)$ in the payoff when joining some coalition $p \not\in S$. Shapley values represent an average of this surplus across all coalitions and across all ways in which a particular coalition may arise when players form them one by one. For more detailed information, including the formulae for computing Shapley values, one may read the Wikipedia page.
To gain some further intuition about the Shapley values, let's compute them for a simple example of a three-player coalitional game. Let the payoff function $v$ for this game be: $$ \begin{align*} v() &= 0\\ v(A) &= 20\\ v(B) &= 20\\ v(C) &= 60\\ v(A, B) &= 60\\ v(A, C) &= 70\\ v(B, C) &= 70\\ v(A, B, C) &= 100 \end{align*} $$
Then, we can compute the Shapley value $\phi_A(v)$ in a following fashion: $$ \begin{align*} \phi_A(v) &= \frac{1}{\text{\# of orders of players}} \sum_{\text{orders of players}} (v(\text{$A$ and players preceding $A$}) - v(\text{players preceding $A$)})\\ &= \frac{1}{6}(2[v(A) - v()] + [v(B, A) - v(B)] + [v(B, C, A) - v(B, C)]\\ &+ [v(C, A) - v(C)] + [v(C, B, A) - v(C, B)])\\ &= 25 \end{align*}$$ which we can interpret as $A$ bringing, on average, a $25$ surplus to a randomly formed coalition.
Shapley values provide an attractive framework for dealing with feature importances. If we view a predictive model as a cooperative game between the predictor variables, with the "payoff" being the response variable, then $\phi$'s encode many desirable properties, for example $\phi_x = 0$ signifies feature $x$ being irrelevant to the prediction. However, due to the combinatorial definition of the Shapley values, they can be prohibitively expensive to compute, which gives rise to various model-type-specific approximations, such as TreeSHAP.
Having gained an understanding of how what the SHAP values are, let's now see how they can be used for explaining the predictions of our models for predicting brain stroke. We will use two libraries:
For now, though, we will use the shap library. It has a number of plotting options for visualizing the attributions of the variables. To get a high-level overview, we can use the summary plot:
Here, we can see how different predictor variables affect the output variable. For example, if we look at the hypertension variable, we can see that if the value of this variable is high, as indicated by the red color, its SHAP value is also likely to be high, increasing the chance of stroke. Likewise, low value of age (blue color) is correlated with the chance of a brain stroke being lower than average. Features such as gender_Male or Residence_type_Urban do not seem to have any effect on the likelihood.
We can investigate the impact of a single variable with a dependency plot. Let us plot it for age.
Here we can clearly see that, as the age increases, so does the SHAP value. For example, one conclusion we can draw is that up until the age of around 50 the chance of getting a brain stroke is lower by around 10-20 percentage points (compared to the baseline of 30% for our model), whereas from about the age of 70 the chance is higher by about 30 percentage points.
Another dimension along which we can analyze the results in greater detail is to look at the variable attributions for a single observation. Let's therefore pick some random observation:
We will take a look at the results of our model for a married 36-year-old woman with a fairly normal BMI index of 23.2 - we might expect the probability to be very low.
And it is indeed quite low, around 8%. From this diagram we can derive a number of other useful obsevations:
Not being married seems to consistently reduce the chance by "a fair bit", whereas being in marriage increases it slightly. (As for why that might be the case, I don't know, but I thought it interesting.)
As we have seen from the plots, variables might have positive or negative attributions, depending on its value. Let's for example take a look at two records:
The variable attributions for them are:
And indeed, in the first case age, hypertension and bmi, among others, have negative attributions, their value indicating reduced chance of a stroke, whereas in the second case all of these variables point to increased possibility of the attack.
Another phenomenon we can observe is different observations having different variables which best explain the predicted outcome. For example, for:
we get:
Or, in other words, for the first subject variables of greatest import in explaining the results are age and hypertension, whereas for the other bmi and ever_married_Yes play a greater role - notably, for the second person age has a fairly small impact on the prediction, even though in almost all previous studied cases we've seen it at or near the top of the list.
SHAP values grant us an understanding of how a predictive model arrives at its conclusions. Naturally, we can also use it for the purposes of comparing different types of models - this may help us, for example, decide which one of them we ought to use, beyond such standard criteria as accuracy, precision or recall. To be specific, we will now look at the predictions and SHAP values for a logistic regression model.
One important thing to note are the different units between the tree model and the logistic regression model - the latter outputs log odds, which we can convert to raw probability with the sigmoid function. Notwithstanding, it's pretty clear that the behavior of the model is wildly different, at least for these observations. For example, the SHAP values for the logistic regression seem to have a way of trying to balance out the dummy variables (like smoking_status_formerly_smoked) which we've introduced to convert them to something intelligible by the model. Another thing is that it seems to not put as great a weight on the BMI value as the tree model, it being evidenced by not showing up on either of the plots, which we would expect a predictive model to take into account. Let's in fact look at the summary plot for the model:
It would seem, glacing at it, that the dummy variables get assigned a SHAP value depending on the variable value alone, and variables like age have the impact more "spread out". Personally, I found the results for the tree model more "natural" and interpretable, which could be a factor in choosing it over the linear model - as we might recall from Part I, the AUROC scores for these two models were very similar, so if we didn't take into account how these models arrive at their results, we could make a choice that e.g. could introduce anomalous results.
To finish off the analysis, we will now compare packages. So far we have focused on shap package - we will now compare it with dalex to briefly check whether the results are the same, and also to see how one might approach visualization of the results differently.
I would say that (1) the attributions seem broadly similar, although I will note that for the second observation being a former smoker was given a lesser impact in dalex than in shap, among other differences; (2) the plots (well, this particular type of plot) are fairly visually similar. I suppose now would be the time and the place to recommend one of them, but I haven't explored them to a degree which would allow me to make a learned judgement.